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Abstract 

In binary classification problems, mainly two approaches have been proposed; one is loss 
function approach and the other is uncertainty set approach. The loss function approach is 
applied to major learning algorithms such as support vector machine (SVM) and boosting 
methods. The loss function represents the penalty of the decision function on the training 
samples. In the learning algorithm, the empirical mean of the loss function is minimized to 
obtain the classifier. Against a backdrop of the development of mathematical programming, 
nowadays learning algorithms based on loss functions are widely applied to real-world data 
analysis. In addition, statistical properties of such learning algorithms are well-understood 
based on a lots of theoretical works. On the other hand, the learning method using the so- 
called uncertainty set is used in hard-margin SVM, mini-max probability machine (MPM) 
and maximum margin MPM. In the learning algorithm, firstly, the uncertainty set is defined 
for each binary label based on the training samples. Then, the best separating hyperplane 
between the two uncertainty sets is employed as the decision function. This is regarded as an 
extension of the maximum-margin approach. The uncertainty set approach has been studied 
as an application of robust optimization in the field of mathematical programming. The 
statistical properties of learning algorithms with uncertainty sets have not been intensively 
studied. In this paper, we consider the relation between the above two approaches. We point 
out that the uncertainty set is described by using the level set of the conjugate of the loss 
function. Based on such relation, we study statistical properties of learning algorithms using 
uncertainty sets. 



1 Introduction 

In classification problems, the goal is to predict output labels for given input vectors. For this 
purpose, a decision function defined on the input space is estimated from training samples. The 
output value of the decision function is used for the label prediction. In binary classification 
problems, the label is predicted by the sign of the decision function. 

Many learning algorithms use loss functions to measure the penalty of misclassifications. 
The decision function minimizing_^the emp irical mean of the loss function over training samples 



is employed as the estimator [8..,24l.ll2l.ll4l]. For example, hinge loss, exponential loss and logistic 



loss are used for support vector machine (SVM), Adaboost and logistic regression, respectively. 
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Especially in the binary classification tasks, statistical properties of learnin g algori thms based 
on loss functions are well-understood due to intensive recent works. See [3. bd. 25 . [i^ . [sol . [2^ 
for details. 

As another approach, the maximum-margin criterion is also applied for the statistical learn- 
ing. Under the maximum-margin criterion, the best separating hyperplane between the two 
output labels is employed as the decision function. In hard-margin SVM 29(], a convex-hull 
of input vectors for each binary label is defined, and the maximum-margin between the two 
convex-hulls is considered. For the non-separable case, i^-SVM provides a similar picture 24. 5|. 
In 1/-SVM, the so-called reduced convex-hull which is a subset of the original convex-hull is 
used for the learning. A reduced convex-hull is defined for each label, and the best separating 
hyperplane between the two reduced convex-hulls is employed as the decision function. Not 
only polyhedral sets such as the convex-hull of finite input points but also ellipsoidal sets are 
applied for classification problems 15|)ll8!]. In this paper, the set used in the maximum-margin 
criterion is referred to as uncertainty set. This term is borrowed from robust optimization in 
mathematical programming [3]. 

There are some works in which the statistical properties of the learning based on the uncer- 



tainty set are studied. For example, [15[ proposed minimax probability machine (MPM) using 



the ellipsoidal uncertainty sets, and studied statistical properties under the worst-case setting. 
In the statistical learning using uncertainty set, the main concern is to develop optimization 
algorithms under the maximum margin criterion [17l |. So far, statistical properties of the learn- 
ing algorithm using uncertainty sets have not been intensively studied compared to the learning 
using loss functions. 

The main purpose of this paper is to study the learning algorithm using the uncertainty 
set. We focus on the relation between the loss function and the uncertainty set. We show 
that the uncertainty set is described by using the conjugate function of the loss function. For 
given uncertainty set, we construct the corresponding loss function. We study the statistical 
properties of the learning algorithm using the uncertainty set by applying theoretical results on 
the loss function approach. Then, we establish the statistical consistency of learning algorithms 
using the uncertainty set. We point out that in general the maximum margin criterion for 
a fixed uncertainty set does not provide accurate decision functions. We need to introduce a 
parametrized uncertainty set by the one-dimensional parameter which specifies the size of the 
uncertainty set. We show that a modified maximum margin criterion with the parametrized 
uncertainty set recovers the statistical consistency. 

The paper is organized as follows. In Section El we introduce the existing method based 
on the uncertainty set. In Section [3l we investigate the relation between loss functions and 
uncertainty sets. Section S] is devoted to illustrate a way of revising the uncertainty set to 
recover nice statistical properties. In Section [5l we present a kernel-based learning algorithm 
with uncertainty sets. In Section [6l we prove that the proposed algorithm has the statistical 
consistency. Numerical experiments are shown in Section [71 We conclude in section [8l Some 
proofs are shown in Appendix. 

We summarize some notations to be used throughout the paper. The indicator function 
is denoted as {AJ, i.e., {AJ equals 1 if A is true, and otherwise. The column vector x in 
the Euclidean space is described in bold face. The transposition of x is denoted as x"^. The 
Euclidean norm of the vector x is expressed as ||a;||. For a set S" in a linear space, the convex- hull 
of S is denoted as conviS* or conv(S'). The number of elements in the set S is denoted as 
The expectation of the random variable Z w.r.t. the probability distribution P is described as 
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Ep[Z]. We will drop the subscript P as ]E[Z], when it is clear from the context. The set of all 
measurable functions on the set X is denoted by Lq{X) or Lq for short. The supremum norm 
of / £ Lq is denoted ||/||oo- For the reproducing kernel Hilbert space T-i, ||/||'^ is the norm of 
f £ y. defined from the inner product (•, •)-^ on Ti. 

2 Preliminaries 

We define X as the input space and {+1,-1} as the set of binary labels. Suppose that the 
training samples {xi,yi), . . . , {xm, Um) & Xx {+1, —1} are drawn i.i.d. according to a probability 
distribution P on X x {+1,-1}. The goal is to estimate a decision function / : — )■ M from a 
set of functions J-, such that the sign of f{x) provides an accurate prediction of the unknown 
binary label associated with the input x under the probability distribution P. In other word, for 
the estimated decision function /, the probability of sign(/(2;)) ^ y is expected to be as small as 
possible. In this article, the composite function of the sign function and the decision function, 
sign(/(a;)), is referred to as classifier. 

2.1 Learning with loss functions 

In binary classification problems, the prediction accuracy of the decision function / is measured 
by the 0-1 loss |y/(x) < 0] which equals 1 when the sign of f{x) is different from y and 
otherwise. The average prediction performance of the decision function / is evaluated by the 
expected 0-1 loss, i.e., 

£{f) = E[lyf{x)<0]]. (1) 

The Bayes risk £* is defined as the minimum value of the expected 0-1 loss over all the measurable 
functions on X, 

£* = mf{£if) : / G Lq}. (2) 

Bayes risk is the lowest achievable error rate under the probability P. Given the set of training 
samples, T = {{xi,yi), . . . , {xm, ym)}, the empirical 0-1 loss is denoted by 

^ m 

m ^-^ 

1=1 

The subscript T in <?t(/) is dropped if it is clear from the context. 

In general, minimization of Sxif) is considered as a hard problem [l[. The main difficulty is 
considered to come from non-convexity of the 0-1 loss lyf{x) < 0] as the function of /. Hence, 
many learning algorithms use a surrogate loss of the 0-1 loss in order to make the computation 
tractable. For example, SVM uses the hinge loss, max{l — yf{x),0}, and Adaboost uses the 
exponential loss, exp{—yf{x)}. Both the hinge loss and the exponential loss are convex in /, 
and they provide an upper bound of the 0-1 loss. Thus, the minimizer under the surrogate loss 
is also expected to minimize the 0-1 loss. The quantitative relation between the 0-1 loss and the 
surrogate loss was studied by Q]. 

To avoid overfitting of the estimated decision function to training samples, the regularization 
is considered. By adding the regularization term such as the squared norm of the decision func- 
tion to the empirical surrogate loss, the complexity of the estimated classifier is restricted. The 
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balance between the regularization term and the surrogate loss is adjusted by the regularization 



parameter [111. |26||. Then, the deviation of the empirical 0-1 loss and the expected 0-1 loss is 
controlled by the regularization. When both the regularization term and the surrogate loss are 
convex, the computational tractability of the statistical learning is retained. 

2.2 Learning with uncertainty sets 

Besides statistical learning using loss functions, there is another approach to the classification 
problems, i.e., statistical learning based on the so-called uncertainty set. We briefly introduce 
the basic idea of the uncertainty set. We assume that is a subset of Euclidean space. 

In robust optimization problems Q] , the uncertainty set describes uncertainties or ambiguities 
included in optimization problems. The parameter in the optimization problem may not be 
precisely determined. Instead of the precise information, we have an uncertainty set which 
probably includes the parameter in the optimization problem. The worst-case setting is employed 
to solve the robust optimization problem with the uncertainty set. 

The statistical learning with uncertainty set is considered as an application of the robust 
optimization to classification problems. In classification problems, the uncertainty set is designed 
such that most training samples are included in the uncertainty set with high probability. We 
prepare an uncertainty set for each binary label. For example, Up and Un are the confidence 
regions such that the conditional probabilities, P{x € Up\y = +1) and P(x G Un\y = are 
equal to 0.95. As the other example, the uncertainty set Up (resp. Un) consists of the convex-hull 
of input vectors in training samples having the positive (resp. negative) label. The convex-hull 
of data points is used in hard margin SVM [H]. The ellipsoidal uncertainty set is also used for 
the robust classification under the worst-case setting [l^, [3] • 

Based on the uncertainty set, we estimate the linear decision function f{x) = w^x + b. Here, 
we consider the minimum distance problem 

min \\xp — Xn\\ subject to Xp £ Up, x^ G Un- (4) 

Let Xp and be optimal solutions of (jH). Then, the normal vector of the decision function, w, 
is estimated by c{x* — jc*), where c is a positive real number. Figure [1] illustrates the estimated 
decision boundary. When both Up and Un are compact subsets satisfying Up CiUn = 0, the 
estimated normal vector cannot be the null vector. The minimum distance problem appears in 
the hard margin SVM 0,13, I/-SVM 0,3 and the learning algorithms proposed by |3, [13] • 



In Section 13.11 we briefly introduce the relation between i^-SVM and the minimum distance 
problem. In minimax probability machine (MPM) proposed by [3], the other criterion is applied 
to estimate the linear decision function, though the ellipsoidal uncertainty set plays an important 
role also in their algorithm. 

The minimum distance problem is equivalent with the maximum margin principle 0, [H]- 
When the bias term b in the linear decision function is estimated such that the decision boundary 
bisects the line segment connecting x* and a;*, the estimated decision boundary achieves the 
maximum margin between the uncertainty sets. Up, Un- According to we explain how the 
maximum margin is connected with the minimum distance. Suppose that Up and Un are convex 
subsets and that UpCiUn = holds. Then, the margin of two uncertainty sets along the direction 
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Figure 1: The estimated decision boundary based on the minimum distance problem with the 
uncertainty sets Up and lAn- 

of w is given as 

mm < — : a;„ G Wp, Xn G Un 



The maximum margin criterion is described as 



max mm 



I \w\\ ~ ' ^ ^" &Un> = min{||a;p - x^W : Xp G Up, x^ G Un}. 



The equahty above follows from the minimum norm duality 16(] 



3 Relation between Loss Functions and Uncertainty Sets 

We study the relation between loss functions and uncertainty sets. First, we introduce the 
relation in i/-SVM according to H and Then, we present an extension of z^-SVM to investigate 
a generalized relation between loss functions and uncertainty sets. 

3.1 Uncertainty Set in i^-SVM 

Suppose that the input space X is a subset of Euclidean space M^. We consider the linear decision 
function, f{x) = w'^x + b, where the normal vector w ^ and the bias term 6 G M are to be 
estimated based on observed training samples. By applying the kernel trick @, H^, we obtain 
rich statistical models for the decision function, while keeping the computational tractability. 
In z^-SVM, the classifier is estimated as the optimal solution of 



min -\\w\\^ - up + —"S^ ma.x{p - yAw'^Xi + b), 0}, w eR'^, 6 G M, p G M, (5) 



1|| l|2 

w,b,p 2 m 



i=l 
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where v G (0, 1) is a prespecified constant which has the role of the regularization parameter. 
As [2j] pointed out, the parameter v controls the margin errors and number of support vectors. 
In Z/-SVM, a variant of the hinge loss, max{p — yi(w^Xi + 6), 0}, is used as the surrogate loss. 
In the original formulation of j^-SVM, the non-negativity constraint, p > 0, is introduced. As 
shown by [9|, we can confirm that the non-negativity constraint is redundant. Indeed, for an 
optimal solution to, 6, we have 

1 1 

^ II ^11 2 ^ \ ^ / - jT 1 

— z^p < - It) —vp^ > maxjp — a;j + o),0}<0, 

2 m ^-^ 

where the last inequality comes from the fact that the parameter, it? = 0, 5 = 0, /? = 0, is a 
feasible solution of ([5]). As a result, we have p > for > 0. 

We briefly show that the dual problem of ([5]) yields the minimum distance problem in which 
the reduced convex-hulls of training samples are used as uncertainty sets. See jH] for details. 
The problem ([5]) is equivalent with 

2 -I m 



mm -\\w\\~—vp^ y^^i, 



|2 

w,b,p,^ 2" ' ' ' m 



4 = 1 



subject to > 0, > /o — yi{w xi + b), i = 1, . . . , m. 
Then, the Lagrangian function is defined as 

L{w,b,p,$,cx,(3) = - \\wf - vp^ "^ii + ^aiip - yi{w^Xi + b) - ^i) - ^A^i, 

i=l 1=1 i=l 

where Oi, /3i, i = l,...,m are non-negative Lagrange multipliers. For the observed training 
samples, we define Mp and M„ as the set of sample indices for each label, i.e., 

Mp = {i\yi = +1}, Mn = {i\yi = -1}. (6) 

By applying min-max theorem, we have 

inf sup L{w,b, p,^,a, (3) 

■^.fe.P4a>0,/3>0 

= sup inf L{w,b, p,^, a, P) 

= snp < - -W^aiyiXiW : = u, "^aiyt = 0, < ai < — \ 

^ i=l i=l 1=1 

y inf jll ^i^i- Yl ^J^ill^ • = = 1' 0< 7i < ^,^ = 

(7) 



where the last equality is obtained by changing the variable from aj to 7^ = 2ai/u. For the 
positive (resp. negative) label, we introduce the uncertainty set Up (reps. Un) defined by the 
reduced convex-hull, i.e., 

o G {p, n}, = I V ^iXi : V 7i = 1, < 7^ < — , i G Mq I. 
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When the upper hmit of 7^ is less than one, the reduced convex-hull is a subset of the convex-hull 
of training samples. We find that solving the problem d?]) is identical to solving the minimum 
distance problem under the uncertainty set of the reduced convex- hulls, 

inf \\xp — Xn\\ subject to Xp G ^,, Xn G Un- 

The representation based on the minimum distance problem provides an intuitive understanding 
of the learning algorithm. 



3.2 Uncertainty Set Associated with Loss Function 

We consider general loss functions, and study the relation between the loss function and the 
corresponding uncertainty set. Again, the decision function is defined as f{x) = w^x + b 
on W^. Let ^ : M — )• M be a convex and non-decreasing function. For the training samples, 
{xi,yi), . . . , {xm,ym), we propose a learning method in which the decision function is estimated 
by solving 

^ m 

M -2p + — S^ £(p- yi(w'^Xi + b)) subject to lliuf < A^ 6 G M, p G M. (8) 
w,b,p m ^-^ 

i=l 

The regularization effect is introduced by the constraint < A^, where A is the regularization 

parameter which may depend on the sample size. 

The statistical learning using ([SD is regarded as an extension of z^-SVM. To see this, we 
define £{z) = max{2z/i^, 0}. Let w,b,phe an optimal solution of ([5]) for a fixed v G (0, 1). By 
comparing the optimality conditions of ([5]) and ([8]), we can confirm that the problem ([8]) with 
A = II-jBII has the same optimal solution as i^-SVM. 

In the similar way as i^-SVM, we derive the uncertainty set associated with the loss function 
£ in ([8|). We introduce the slack variables = l,...,m satisfying the inequalities > 
p — yi{w'^Xi + 6), i = 1, . . . , m. Then, the Lagrangian function of ([8]) is given as 

^ m m 

L{w, b, p, a, = -2p + — 5^ + ^i(P - Vii'^^^i + 6) + - A^), 

i=l 1=1 

where ai, . . . , am and p are the non-negative Lagrange multipliers. The optimality conditions, 

— = 0, and ^ = 
op 00 

and the non-negativity of lead to the constraint on Lagrange multipliers, 

^ = ^ = 1, ai>0. 

We define the conjugate function of £{z) as 

t{x) = sup{xz — l{z)}. 
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Then, by applying min-max theorem, we have 



inf sup L{w,b, p,^,a, fj.) 
= sup inf L{w,b, p,^,a, fi) 



,2 ;^2) 



sup inf <^ y^(mai^i - -y^ OiyixJw + fi{\\w\ 

: ^ = = 1, > ^ 

f 1 1 2 1 

-inf <— ^^*(maj) + A ^ Ojajj - ^ ajajj : ^ = ^ = 1, a.j > L 

^ i=l 'tgA/„ j6M„ ieJ\/„ ieM„ 



(9) 



In Section [6l we present a rigorous proof that under some assumptions on the min-max 

theorem works in the above Lagrangian function, i.e., there is no duahty gap. For each binary 
label, we define the parametrized uncertainty sets, ZYp[c] and ^n[c], by 

o e {p,n}, Uo{c\ = I X] : > 0, ^ = 1, — ^ t{ma^ < cl. (10) 

Then, the optimization problem in Q is represented by 

inf Cp + c„ + All^p - z„|| subject to Zp G ZYp[cp], 2;„ G ZY„[c„,], Cp, c„ G M. (11) 

Let 2p and z„ be the optimal solution of Zp and Zn in (jlip . Let w be an optimal solution of 
in ([8|). The saddle point of the above min-max problem Q provides the relation between 
the Sp, Zn and w. Some calculation yields that, when zj, = z„ holds, any vector such that 
\w\^ < satisfies the KKT condition of 1^. On the other hand, when Zp ^ Zn holds, w is 
given by w = \{zp — Zn)/\\zp — z„,||. Hence, an optimal solution of the normal vector in the 
linear decision function is given as 



w 



[0, 



We show a sufficient condition that the equality Zp = Zn holds. Suppose that Up[cp] nZY„[c„,] 
is nonempty for all Cp and c„, whenever Up[cp] and V(n[cn] are both nonempty. Then, clearly 
^^p = z„ G V(p[cp] r\Un[cn] is the optimal choice of the objective function in pT]) . In z^-SVM with 
a small v > 0, the reduced convex- hulls satisfy UpCiUn = 0, and hence, Zp = Zn and w = hold. 

The bias term b in the linear decision function is not directly obtained from the optimal 
solution of (jlip without knowing the explicit form of the loss function L A simple way of 
estimating the bias term is to choose b = —{w^Zp + iB^z„)/2, which provides the decision 
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Learning with uncertainty set: 

Step 1. Given training samples, we construct parametrized uncertainty sets 
Up [c] and Un [c] in some way. 

Step 2. Solve and obtain the normal vector by (jl2p . 

Step 3. The bias term of the decision function is estimated by (fT3|) . 



Figure 2: Learning algorithm based on uncertainty set. 



boundary bisecting the line segment connecting Zp and z^- In the learning algorithm proposed 
in Section [5l the bias term is estimated by minimizing the error rate 

^ m 

min- V[2/i(t«^a;, + 6) < 0]. (13) 

%=\ 

Since the estimated normal vector w is substituted in the above objective function, the opti- 
mization is tractable. 

Based on the argument above, we propose the learning algorithm using uncertainty sets in 
Figure El It is straightforward to apply the kernel method to the algorithm. In order to study 
statistical properties of the learning algorithm based on uncertainty sets, we need more elaborate 
description on the algorithm. Details are presented in Section [H 

We show some examples of uncertainty sets (jlOp associated with popular loss functions. In 
the following examples, the index sets, Mp and Af„,, are defined by ([6]) for the training samples 
(a;i,yi), . . . , {xm,ym), and let rup and m„, be nip = \Mp\ and m„ = |M„|, respectively. 

Example 1 (z^-SVM). As explained above, the problem (jS]) is reduced to ly-SVM by defining 
i{z) = max{2z/z^, 0}. The conjugate function of i is given as 



a [0,2/1/], 



r(a)= ' 
[oo, 

and the associated uncertainty set is defined by 

y aiXi : y^ai = l,0<ai<—,ieMo\, c > 0, 
^-^ ^-^ mu I 



o£ {p,n}, Uo[c] = I ligM„ 



c < 0. 



For c > 0, the uncertainty set consists of the reduced convex-hull of training samples, and it does 
not depend on the parameter c. In addition, the negative c is infeasible. Hence, in the problem 
(jlip . optimal solutions of Cp and Cn are given as Cp = Cn = 0, and the problem is reduced to the 
simple minimum distance problem. 
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Example 2 (Truncated quadratic loss). Now consider £{z) = (max{l + z,0})^. The conjugate 
function is 

r(«)= -" + T' 

I oo, a < 0. 

For a G {p, n}, we define Xq and T,o as the empirical mean and the empirical covariance matrix 
of the samples {xi : i G M^}, i.e., 

Xq — Xi, So = (Xi XQ](Xi Xq) . 

° i€Mo ieMo 

Suppose that So is invertible. Then, the uncertainty set corresponding to the truncated quadratic 
loss is given as 

a G {p,n}. Hole] = I ^ oiiXi : ^ = 1, > 0, i G Mo, ^ of < li£-lLll| 



z G conv{xi : i G Mq} : {z — Xo)'^'^'^^{z — Xq) < 



T^^U^ ^ ^ / 4(c+ l)mo 



m 



To prove the second equality, let us define the matrix X = {xi, . . . ,Xmo) G ]R'^><''"o. q-^ = 
{ai)i^Mo satisfying the constraints, the equality z = YlieMo ^i^i — ~ ^o^'^)oto + Xq holds, 
where 1 = (1, . . . , 1)"^ G . Then, the singular value decomposition of the matrix X — XqI^ 
and the constraint HctolP ^ 4(c + l)/m yield the second equality. A similar uncertainty set is 



used in minimax probability machine (MPM) UA l and maximum margin MPM Ua l. though the 
constraint, z G convja^j : i G Mq}, is not imposed in these learning methods. 



Example 3 (exponential loss). The loss function £{z) = e^ is used in Adaboost llSi . \l3i l. The 

conjugate function is equal to 



'(a) 



—a + aloga, a > 0, 
oo, a < 0. 

Hence, the corresponding uncertainty set is defined as 



[c] = \ ^ aiXi : ^ ai = 1, a, > 0, i G Mo, ^ log — ^ < c + 1 + log — ^ 



ieMo i£Mo ieMo 

for a G {p, n}. In the uncertainty set, the Kullback-Leibler divergence from the weight ai,i £ Mq 
to the uniform weight is bounded above. 

In this section, we derived parametrized uncertainty sets associated with convex loss func- 
tions. Inversely, if the uncertainty set is represented as the form of (jlOp . there exists the corre- 
sponding loss function. When we consider statistical properties of the classifier estimated based 
on the uncertainty set, we can study the equivalent estimator derived from the corresponding 
loss function. We have many theoretical tools to analyze such estimators. However, if the uncer- 
tainty set does not have the expression of ()10p . the corresponding loss function would not exist. 
In this case, we cannot apply the standard theoretical tools to understand statistical properties 
of learning algorithms based on such uncertainty sets. One way to remedy the drawback is to 
revise the uncertainty set so as to possess the corresponding loss function. The next section is 
devoted to study a way of revising the uncertainty set. 
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4 Revision of Uncertainty Sets 

Given a parametrized uncertainty set, generally there does not exist the loss function which 
corresponds to the uncertainty set. In this section, we present a way of revising the uncertainty 
set such that there exists a corresponding loss function. 

We consider two kinds of representations for parametrized uncertainty sets: one is vertex 
representation, and the other is level-set representation. Let Mp and be index sets defined in 
([6]), and we define rrip = \Mp\ and m„ = |M„|. For o € {p, n}, let Lq be a closed, convex, proper 
function on M"^°, and L* be the conjugate function of Lq. The argument of L* is represented 
by OLo = {ai)i£]\i^. The vertex representation of the uncertainty set is defined as 

^o[c] = I ^ OiXi : L*{ao) <ci, o G {p,n}. (14) 

In Example m the function L*{ao) = ^ YlieMo ctf — 1 is employed. On the other hand, let us 
define /iq : M'^ — )• M as a closed, convex, proper function, and h* be the conjugate of ho- The 
level-set representation of the uncertainty set is defined by 

^o[c] = < aiXi : hK"^ aiXi) < A, o£{p,n}. (15) 

^ ieMo ieMo ^ 

The function h* may depend on the population distribution. We suppose that h* does not depend 
on the sample points, Xi,i £ Mq. In Example [21 the second expression of the uncertainty set 
involves the convex function h*{z) = (z — Xo)'^Il~^{z — Xg). This function does not satisfy 
the assumption, since h* depends on training samples via Xq and Sq. Instead, the function 
/i*(z) = (z — /Xo)-^S~^(z — fio) with the population mean fig and the population covariance 
matrix Sq meets the condition. When fXg and So are replaced with the estimated parameters 
based on a prior knowledge or a set of samples independent of the training samples, {xi : i G Mq}, 
the function /i* with the estimated parameters still satisfies the condition we imposed above. 



4.1 Prom uncertainty sets to loss functions 

In popular learning algorithms using uncertainty sets such as hard-margin SVM, i^-SVM and 
maximum margin MPM, the decision function is estimated by solving the minimum distance 
problem (j4]) with Up = Up[cp\ and Un = Un[cn], where Cp and c„ are prespecified constants. In 
order to investigate the statistical properties of the learning algorithm using uncertainty sets, 
we consider the primal expression of a variant of the minimum distance problem (j4]). 

In Section O we derived the problem (jlip as the dual form of dH). Here, we consider the 
following optimization problem to obtain the loss function corresponding to given uncertainty 
sets having the vertex representation (I14p . 

min Cp + Cn + X\\zp - Zn\\ 

subject to Cp,Cn G M, ^-^q^ 
Zp G Up[cp] n conv{xi : i G Mp}, 
Zn G ^n[cn] H conv{a;i : i G M„}. 

In the above problem the constraints, Zq G conv{sj : i G Mo},o G {p, n}, are added, since 
the corresponding uncertainty set ()10p has the same constraint. We derive the primal problem 
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corresponding to p6|) via the min-max theorem. A brief calculation yields that p6|) is equivalent 
to 

m 

min L*{cxp) + Ll{an) + All aitjixAl 

^ (17) 
subject to y, — 1' / , "^j = > (i = 1, . . . , m). 

If there is no duality gap, the corresponding primal formulation of ()17p is given as 
inf -2p + Lp(^p) + L„(^„), 

If.^P.^p.^n (18) 

subject to p — yi{w'^Xi + b) < S^i, i = 1, . . . ,m, \\w\\'^ < A^, 

where is defined as $o = {Ci)i£Mo ^or o G {p, n}. 

In the primal expression (|18p . Lp and L„ are regarded as the loss function for the decision 
function w'^x + b on training samples. In general, however, the loss function is not represented as 
the empirical mean over training samples. Thus, we cannot apply the standard theoretical tools 
to investigate statistical properties such as Bayes risk consistency for the learning algorithm 
based on (jl6p or (jlSp . On the other hand, if the problem (jlSp is described as the empirical loss 
minimization, we can study statistical properties of the algorithm by applying the statistical 
theory developed by [i^, 26, 0|- To link the uncertainty set approach with the empirical loss 



minimization, we consider a revision of the uncertainty set. 

4.2 Revised uncertainty sets and corresponding loss functions 



We propose a way of revising uncertainty sets such that the primal form (|T8|) is represented 
as minimization of the empirical mean of a loss function. Remember that the additivity of the 
function is kept unchanged in the conjugate function, i.e., (^i(-Zi)+^2(-22))* = {h{zi))* + {i2{z2))* ■ 

Revision of uncertainty set defined by vertex representation: Suppose that the un- 
certainty set is described by ()14p . For o € {p,n}, we define mo-dimensional vectors 
lo = (1, . . . , 1) and Oo = (0, . . . ,0). For the convex function L* : — > R, we define 

r : M ^ M U {oo} by 

{ry rv 
^;(-ip)+^:(-in)-w-^:(o«) ">0' (19) 
oo, Q < 0. 

The revised uncertainty set Z^o[c], o E {p,n} is defined as 

iio[c] = 1^2 ■ «j = 1; "i > 0' ^ £ — ^2 (*{0iifn) < c\. 

Revision of uncertainty set defined by level-set representation: Suppose that the un- 
certainty set is described by (jlSp and that the mean of the input vector x conditioned on 
the positive (resp. negative) label is given as /ip (resp. /x„). The null vector is denoted as 
0. We define the function T : M ^ M by 

f^>^Mp) + ^:("^/^n)-/.;(o)-/.:(o) a>o, ^^^^ 

oo, a < 0. 
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The revised uncertainty set Uq [c] , o E {p, n} is defined as 

Uo[c\ = I oti^i ■ = 1, > 0, i G Mo, — ^ t{aim) < c, I. 

We apply the parahel shift of training samples so as to be /^p 7^ or /x„ ^ 0. 

We explain the reason why the revised uncertainty set is defined as above. In the revision 
([T9|), the uncertainty set is kept unchanged, when the function L* + L* is described in the 
additive form. The precise description is presented in the following theorem. 

Theorem 1. Let L* : R"^° — t- M, o G {p,n} be convex functions, and i* be the function defined 
by (dni) for given L*p and L* . Suppose that ^ : M — )• RU {00} is a closed, convex, proper function 
such that £*{0) = and £*{a) = 00 for a < hold. 



1. Suppose that the equality 



^ m 



m 
1=1 



I* (aim) 



holds for all non-negative i = 1, . . . ,m. Then, the equality i* = i* holds. 
2. Suppose that the equality 

^ m 

L;{alp) + LUaln) - L;{Op) - L;(0„) = - ^r(am) = t{am) 



m 

i=l 



holds for all a > 0. Then, the equality i* = I* holds. 

Proof. We prove the first statement. From the definition of £* and the assumption on i*, the 
equality i*{a) = i*{a) holds for a < 0. Suppose a > 0. The assumption on L* and L* leads to 
L;(^lp) + L* - L*p{Op) - L;(0„) = t{a). Hence, we have t = I*. The second statement 

of the theorem is straightforward. □ 

Theorem [T] implies that the transformation of L* + L* to ^ X^^^i is a projection 

onto the set of functions with the additive form. In addition, the second statement of Theorem 
[T] denotes that the projection is uniquely determined when we impose the condition that the 
values on the diagonal {(a, . . . , a) € M"* : a > 0} are unchanged. 

Next, we explain the validity of the formula (j20p . We want to find a function i*{a) such 
that h*p{^-^j^^aiXi) + /i;(EiGA/„ "i^^i) - h*p{0) - h*^{0) is close to ^ ET=J* i^^^^i) ^ some 
sense. We substitute = a/m into h*{J2i,^]\f^OiiXi), o G {p,n}. In the large sample limit, 
^oCl^ieMo ot/mxi) is approximated by hl{a^fj,o)- Suppose that 

h*{a^Hp) + /i* (a— /x„) - /i*(0) - h'^iO) 

is represented as ^ YA=il*[^'m) = l*{a). Then, we obtain (l20l) . 

For the revised uncertainty sets Up[c\ and Un[c\-, the corresponding primal problem of 

min Cp + Cn + \\\zp - Zn\\ subject to Zp G Z4[cp], z„ G Z^ri[cn] (21) 
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IS given as 

^ m. 

inf —2p-\ / ^(Ci) suhiect to p — yi{w'^Xi + b) < i = 1, ... ,m, ||if |p < A^. 

1=1 

The revision of the uncertainty sets leads to the empirical mean of the revised loss function i. 
When we study statistical properties of the estimator given by the optimal solution of (j2ip . we 
can apply the standard theoretical tools, since the objective in the primal expression is described 
by the empirical mean of the revised loss functions. 

We show some examples to illustrate how the revision of the uncertainty set works. 

Example 4. Let L*, o € {p, n} be the convex function L*{cXo) = ol^CqOLo, where Co is a positive 
definite matrix. The revised function defined by (fT9]) is given as 

for a > 0. Then, we have 



- y t (aim) = IpWlp + ln^nln y ^ 

i=l i=l 

When both Cp and Cn are the identity matrix, the equality 

^ m m 

L*p{ocp) + L;(q„) = — ^i*{otim) = ^ aj 

i=l i=l 

holds. Let k be k = l^Cplp + 1^C„1„. Then, the revised uncertainty set is given as 

a G {p,n}, Uo[c\ = \ ^ : ^ = 1, > (i E Mo), X] - ~p f' 

For G {p,Ti}, let Xo and So be the empirical mean and the empirical covariance matrix, 

1 ■ ^ 1 y 

Xo — / — ^ (cCj Xo)(^Xi Xo) . 

ruo rUo 
If So is invertible, we have 

Uo[c] = < z e convjajj : i G Mo} : {z - Xo)'^T,'^^{z - Xo) < 



k 

In the learning algorithm based on the revised uncertainty set, the estimator is obtained by solving 
min Cp + Cn + X\\zp — Zn\\ subject to Zp G l{p[cp], Zn G Z^„,[c„] 



Acnk 



mm Cp + Cn -\ — ——\\Zp — Zn\\ subject to ZpGUp 

Cp ,Cn fZp ,Zn 4l:fC 



4cp/c 



G Un 



The corresponding primal expression is given as 



■j^ ^ / 7T7, A 

min -2/9 H ^ subject to p - yAw'^Xi + 6) < ^j, < ^j, Mi, llwlp < ( — — 

w.b.pj m ^-^ \ Ak 



2\\ 2 
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original uncertainty set Up[c\ revised uncertainty set Up[c\ 

Figure 3: Training samples and the uncertainty sets are depicted. Left panel: the original 
uncertainty set for the positive label. Right panel: the revised uncertainty set which consists of 
the intersection of the ellipsoid and the convex-hull of the input vectors with positive label. 



Example 5. We define /i* : A' — t- M for o G {p, n} hy 

hl{z) = {z- Ho)^Co{z - Ho) 

where fj,o is the mean vector of the input vector x conditioned on each label and Co is a positive 
definite matrix. In practice, the mean vector is estimated by using a prior knowledge which is 
independent of the training samples {{xi,yi) -.1 = 1,... ,m}. Suppose that fj,o 7^ 0- Then, for 
a > 0, the revision of (pOj) leads to 



t{a) = ((a^ - If - l) ^llCp^lp + ((a^ - l)^ - l) ^^C./x, 



bia + 620^1 



where hi and 62 (> 0) are constant numbers. Thus, we have 

^o[c] = \Y • "i = 1> a* > 0(i G Mo), Y ^ 

= |z G conv{a;i : i G Mo} : {z - :Eo)^S~^(z - Xo) < ruo ■ ^— r^l, 
[ m&2 J 

where Xo and So are the estimators of the mean vector and the covariance matrix based on 
training samples {xi : i G Mo}. The corresponding loss function is obtained in the same way 
as Example \^ Figure illustrates an example of the revision of the uncertainty set. In the left 
panel, the uncertainty set does not match the distribution of the training samples. The revised 
uncertainty set in the right panel seems to well approximate the dispersal of the training samples. 



Example 6. We suppose that for o G {p,n}, /Xq is the mean vector and So is the covariance 
matrix of the input vector conditioned on each label. We define the uncertainty set by 

o G {p, n}, Uo[c] = {z € conv{a;i : i G Mo} : (z - /x)"^So^(z - /x) < c, V/x G ^} , 
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where A denotes the estimation error of the mean vector /x. For a fixed radius r > 0, A is 
defined as 

A = {fiex : {^l - ^lofl:~\^l - tio) < r^} . 

The uncertainty set with estimation error is used by fl^J in MPM. The above uncertainty sets 
will be useful, when the probability in the training phase is slightly different from that in the test 
phase. Brief calculation yields thatUo[c\ is represented by the level set of the convex function 



h*{z) = max {z — ^{z — fx) 



[z - HoV^o ^{z - Ho) + 



The revised uncertainty setUo[c] is defined by the function i* which is given as 



m 



a- 



p 



+ 



m 



a- 



m 



/J-p^p l-ip + r 



l-lJ^Tin fin + r 



(22) 



We suppose that fip ^ and /x„ = hold. Let d = y fj,'^T,p fip and h = r/d{> 0). Then, the 
corresponding loss function is given as 



rur. 



where u{z) as defined as 



u{z) 



z + 2h + l, 



z < -2h - 2, 

-2h -2<z< -2h, 

-2h < z <2h. 



(23) 



— + z(l-/i) + (l + /i)2, 2h<z. 



Figure [7] depicts the function u{z) with h = 1. When r = holds, i{z) is reduced to the 
truncated quadratic function shown in Example\^ anc?0. For positive r, i(z) is linear around 
z = 0. This implies that by introducing the confidence set of the mean vector, A, the penalty for 
the misclassification is reduced from quadratic to linear around the decision boundary, though the 
original uncertainty setUo[c\ does not correspond to minimization of an empirical loss function. 



5 Kernel-based Learning Algorithm 

We present a kernel variant of the learning algorithm using uncertainty sets. Suppose that 
training samples (xi, yi), . . . , {xm, Vm) ^ X x {+1, —1} are observed, where X is not necessarily 
a linear space. We define the kernel function k : — )■ R, and let H be the reproducing kernel 
Hilbert space (RKHS) endowed with the kernel function k. See [j^ ] for the details of the kernel 
estimators in machine learning. We consider the estimator of the decision function having the 
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Figure 4: The loss function u{z) in Example [6] is depicted, which corresponds to the revised 
uncertainty set with the estimation error. 

form of f{x) + b, where f G Ti, b £ M. In our algorithm, the function part f{x) and the bias 
term b are separately estimated. 

Figure [5] shows a kernel variant of the learning algorithm based on uncertainty sets. The al- 
gorithm is regarded as an extension of i/SYM and maximum margin MPM, since the uncertainty 
set is extended from reduced convex- hull or ellipsoidal uncertainty set to general uncertainty set. 
The proposed algorithm is also a revision of the existing method based on the simple minimum 
distance problem. We shall illustrate the proposed algorithm in the below. 

In the learning algorithm, training samples are divided into two disjoint subsets, Ti and T2, 
which are described as 

7fc = {(^f^?/f ) : i = l,---,mk}, A; = 1,2. 

The reason that we decompose the training samples is to simplify the analysis of statistical 
properties of the learning algorithm. In the kernel-based algorithm, the uncertainty sets, Up[c] 
and Un [c] , are convex subsets in T-L . Let Mp and M„ be the index sets of Ti defined by 

Mp = {i : ^ = +1, i = 1, . . . , mi}, M„ = {i : ^ = -1, i = 1, . . . , mi}. 

For o G {p, n} , the uncertainty set Uo [c] C H is defined as a convex subset of the convex-hull of 
{k{-,xl^^) : i € Mo}. Moreover, we assume that the monotonicity Uo[c\ C Uo[c!] holds for c < c'. 
If necessary, we revise the uncertainty set as shown in Section |4] in order to link the uncertainty 
set with a loss function. 

When the uncertainty sets involve some parameters to be estimated, a prior knowledge or 
additional samples independent of the training samples Ti U T2 are used for its estimation. For 
example, the uncertainty set defined by the level set of ho{z) = {z — ^o)-'"ll~^(z — /Xq), o G {p, n} 
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Inputs. Decompose the training samples into two disjoint subsets, 



ri = {(xf\y«) : i = l,...,mi}, T2 = {{xf\yf^) : i = 1, . . . , jns}. 

For the set of training samples Ti, let Mp and M„ be the index sets 
defined by Mp = {i : y^^^ = +1, i = 1, . . . , mi} and M„ = {i : y^^^ = 
— 1, i = 1, . . . , mi}, respectively. 

Initialization. We define the RKHS Ti with the kernel function k{x,x'). 
Prepare the parametrized uncertainty sets hlp[c\ and Un[c\ in T-L such 
that 

Up[c] C conv{A;(-,xp^) : i e Mp}, ^„[c] C conv{/c(-, ^) : i £ Af„}. 

When the uncertainty sets involve some parameters to be estimated, 
a prior knowledge or additional samples independent of the training 
samples Ti U T2 are used for its estimation. If necessary, we apply 
the revision of the uncertainty sets presented in Section |4] in order to 
link the uncertainty set with a loss function. Set the regularization 
parameter A > 0. 

Step 1. Solve the optimization problem, 

inf Cp + Cn + X\\fp- fnWn 

subject to fp G UplCp], fn G Kn[Cn], Cp,Cn G M. 

Optimal solutions of fp and fn are denoted as fp and /„. Define / by 



/ 



'{fp fn)i fp 7^ fn 



Wfp - fnWn _ _ 

0) fp = fn- 



Step 2. Solve the one-dimensional optimization problem defined from the 
estimator / and the data set 

min (/ + h) 



The optimal solution is denoted as b. 
Output. The estimator of the decision function is given by f{x) + b. 



Figure 5: Kernel-based learning algorithm using uncertainty sets. 
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involves the mean vector /x^ and the covariance matrix Sq. In our algorithm, we need to prepare 
additional samples to estimate /^o and Sq. 

The subset Ti is used for the estimation of the function part / G in the decision function. 
First, we solve the problem, 

inf Cp + Cn + X\\fp - fnWn 

<:p,C„,fp,fn (24) 
subject to fp £ UplCp], fn G Kn[Cn], Cp, Cn € M. 

Let fp and /„ be optimal solutions of fp and /„ in (j24p . Then, in the same way as ()12p . the 
function part of the decision function is estimated by 

'{fp /n ) ) fp^fni 



f={ Wfp-fnWn " ^ (25) 

0) fp — fn- 

For the estimation of the bias term 5, the data set T2 is used. The bias estimator b is an optimal 
solution of 

mm ST^if + b). (26) 

Our purpose is to obtain the decision function with a low prediction error. Hence, the error 
rate (126p is an appropriate criterion for the estimation of the bias term. Though generally the 
minimization of the training error rate is hard task, the one-dimensional optimization is easily 
conducted. Then, the estimator of the decision function is given by /(x) + b. By separating the 
training data used in Step 1 and Step 2, we can simplify the statistical analysis of the estimator. 

6 Statistical Properties of Kernel-based Learning Algorithm 

In this section, we study statistical properties of the learning algorithm presented in Figure O 
Especially, we prove that the expected 0-1 loss of the estimator, £{f + b), converges to the Bayes 
risk £* defined by i^. 

6.1 Definitions and assumptions 

We derive the dual representation of the learning algorithm in Figure [5l For a convex function 
£ : M — )• M, let £* be the conjugate function of i. For o G {p,n}, suppose that the uncertainty 
sets are described as the form of 

^o[c] = {Y1 "i^(-'^r^) e ^ : XI = 1' ^ ^ ^°)' 5Z t{mai) < c\. (27) 

In the same way as the derivation in Section \'6.2\ we find that the problem (j24p is the dual 
representation of 

, nil 

f,b,P rill i28j 

1=1 

subject to / G -H, & G R, p G M, ||/|||^ < 
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Later on, we show a rigorous proof of the duahty between (j28|) and (j24|) with the uncertainty set 
(j27p . In order to investigate statistical properties of the learning algorithm using uncertainty 
sets, we consider the primal problem ()28p and ()26p instead of the dual problem ()24p and ()26p . 

We define some notations. For a measurable function / : ^ — )• M and a real number p G M, 
we define the expected loss TZ{f,p) and the regularized expected loss TZ\{f,p) by 

n{f,p) = -2p + mp-yf{x))l 

TZxif, p) = -2p + mp - yf{m + oiWfWu < A'), 

where A is a positive number and 6 (A) equals when A is true and oo otherwise. Let TZ* be the 
infimum of TZ{f, p), 

7^* =inf{7^(/,p) : / G Lq, p € M}. 

For the set of training samples, T = {{xi,yi), . . . , {xm,ym)}, the empirical loss TZrif, p) and the 
regularized empirical loss TZT^xif-iP) are defined by 

^ m 

Urif, p) = -2p + - £(p - yd{x,)), 

i=\ 
^ m 

^t,a(/, p) = -2p + - - y^f{x,)) + 0(11/11?, < A2). 

m ^ — ^ 



m 



The subscript T is dropped if it is clear from the context. 

For the observed training samples Ti = {{x\^\yl^^) : i = 1, • • • , ,mi}, clearly the problem 
(PSP is identical to the minimization of '7^Ti,a(/) p)- We define /, b and p as an optimal solution 
of 

mm-RT^x^ if + b,p), f en,beR, peR, (29) 
fAp 

where the regularization parameter A^i may depend on the sample size. For the index sets Mp 
and Mn in Figure O we define nip = \Mp\ and m„ = |M„|. 
We introduce the following assumptions. 

Assumption 1 (universal kernel). The input space X is a compact metric space. The kernel 
function k : — >• M is continuous, and satisfies 



sup y/k(x, x) < K < oo, 

where K is a positive constant. In addition, k is universal, i.e., the RKHS associated with k 
is dense in the set of all continuous functions on X with respect to the supremum norm 12% 
Definition 4-52]. 

Assumption 2 (non-deterministic assumption). For the probability distribution of training 
samples, there exists a positive constant e > such that 

P{{x £X :€< P{+l\x) < 1 - e}) > 

holds, where P{y\x) is the conditional probability of the label y for given input x. 
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Assumption 3 (basic assumptions on the loss function). The loss function £ : M — )• M satisfies 
the following conditions. 

1. i is a non- decreasing, convex function, and satisfies the non-negativity condition, i.e., 
£{z) > for all z G M. 



2. Let di{z) be the suhdifferential of the loss function i at z £ M \2a . Chap. 23]. Then, the 
equality lim^-j^oo ^^(z) = oo holds, i.e., for any M > 0, there exists zq such that for all 
z > zq and all g G di{z), the inequality g > M holds. 

Note that the second condition in Assumption [3] assures that I is not constant function and 
that hm^_j>oo ^(z) = oo holds. 

Assumption 4 (modified classification-caliblated loss). 

1. £{z) is first order differentiable for z > -£{0)/2, and £'{z) > holds for z > -£{0)/2, 
where £' is the derivative of £. 

2. Let -0(0, p) be the function defined as 

i;{9,p)=i{p)-mi \^-^i{p-z) + ^--^e{p + z)\, < 9 < 1, p G M. 



There exist a function ip{9) and a positive real e > such that the following conditions are 
satisfied: 

(a) ^(0) = and ^{9) >0 forO <9 <e. 

(b) '4){9) is a continuous and strictly increasing function on the interval [0,e]. 

(c) The inequality ip{9) < inf ip[9,p) holds for < 9 < e. 

Later on, we shall give some sufficient conditions for existence of the function ip in Assumption 

m 

We prove that there is no duality gap between (j24p and (j28|) . The proof of the following 
lemma is given in Appendix lAl 

Lemma 1. Suppose that both Mp and M„ in Figurel^ are non-empty, i.e., nip and ?7i„ are pos- 
itive numbers. Under Assumption{l\ andl^ there exists an optimal solution for (j28p . Moreover, 
the dual problem of ()28p yields the problem (j24p with the uncertainty set ()27p . 

In the following, we prove the convergence of the error rate to the Bayes risk £*. The proof 
consists of two parts. In Section W?2[ we prove that the expected loss for the estimated decision 
function, Tl{f + 6, p), converges to the infimum of the expected loss TZ*, where /, b and p are 
optimal solutions of (j29p . Here, we apply the mathematical tools developed by [26|]. In Section 
16.31 we prove the convergence of the error rate £{f + b) to the Bayes risk £*, where b is an 
optimal solution of ()26p . In the proof, the concept of the classification-calibrated loss 0] plays 
an important role. 
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6.2 Convergence to Optimal Expected Loss 

In this section, we prove that TZ{f + b,p) converges to TZ*. Following lemmas show the relation 
between the expected loss and the regularized the expected loss. Proofs are shown in Appendix 

m 

Lemma 2. Under Assumptionl^ and Assumptionl^ we have IZ* > — oo. 
Lemma 3. Under AssumptionUl [J and\^ we have 

hm inf{7^A(/, p): f €n, p€R}=n*. (30) 
A— >oo 

We derive an upper bound on the norm of the optimal solution in (j29p . The proof is deferred 
to Appendix iBl 

Lemma 4. Under AssumptionlJl [H andO there are positive constants c and C and a natural 
number M such that the optimal solution of (|29|) satisfies 

WfWn < Ami, \b\ < CXmi, \p\ < CX„^, (31) 
with the probability greater than 1 — e~^'^^ for nii > M . 

Let us define the covering number for a metric space. 
Definition 1 (covering number). For a metric space Q, the covering number of Q is defined as 

n 

Af{g,e) = mm{n G N : 51, . . . ,5„ G ^ such that Q C [J B{gi,e)}, 

i=l 

where B{g,£) denotes the closed ball with center g and radius e. 

According to Lemma [H the optimal solution, /, b and is included in the set 

Qm, = {{f,b,p) €nxR^: WfWn < Xmi,\b\ < C\^„ \p\ < CA^J 

with high probability. Suppose that the norm ||/||oo + \b\ + \p\ is introduced on Qmi- We define 
the function 

L{x, y; f, b, p) = -2p + i{p - y{f{x) + 6)), 

and the function set 

= {L{x,y;f,b,p) : {f,b,p) G GmA- 

The supremum norm is defined on Cmi- The expected loss and the empirical loss, TZ{f + b, p) and 
TZti if + b, p), are represented as the expectation of L{x, y; /, 6, p) with respect to the population 
distribution and the empirical distribution, respectively. Since £ : M — )• M is a finite-valued 
convex function, i is locally Lipschitz continuous. Then, for any sample size mi, there exists a 
constant depending on nii such that 

\i{z) - £{Z')\ < KmAz - Z'\ (32) 
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holds for all z and z' satisfying \z\, \z'\ <(K^ 2C)Ami • Then, for any (/, 5, p), (/', 6', p') G Qmx , 
we have 

\L{x,yJ,b,p) - L{x,y;f',b',p)\ < 2\p- p'\ + KmA\P - P\ + \b- b'\ + ||/ - /'||oo) 

< {2 + K^^){\p-p'\ + \b-b'\ + \\f-f'\U 

The covering number of Cmi is evaluated by using that of Gmi as follows: 

MiCm.e) < MiGm, , ^ ) . (33) 

^ ~r Kmi 

Let the metric space be 

with the supremum norm, then, we also have 



An upper bound of the covering number of J-m^ is given by [10(] and [3l|]. 



We prove the uniform convergence of TZ{f + b,p). The proof is deferred to Appendix iBl 
Lemma 5. Let bmi be 

in which C is the positive constant defined in Lemma Under Assumption U\ and the 
inequality 



p[ sup \n{f + b,p)-n{f + b,p)\>e 

V{/,fe,p)eg™i 

<2AA(£„,,e/3)exp|-^^[> (35) 



962 



^^^r-'9(2T^j [ e ) ^^H" 



(36) 

holds, where k^i is the Lipschitz constant defined by (j32p . 

We present the main theorem of this section. The proof is given in Appendix [Cl 

Theorem 2. Suppose that liuimi^oo = oo holds. Suppose that AssumptionUllM andlMhold. 
Moreover we assume that (j36p converges to zero for any e > 0, when the sample size mi tends 
to infinity. Then, TZ{f + 6, p) converges to IZ* in probability in the large sample limit of the 
dataset Ti. 

We show the order of Ami admitting the assumption in Theorem [2j 
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Example 7. Suppose that X = [0, 1]" C M" and the Gaussian kernel is used. According to fsA] , 
we have 



V ^2 + Krm)J \\ 9(2+;—)/ / 



For any e > 0, (I36p is bounded above by 



exp |o(^ - ^ + (log(A^,A€^J)"+i 

For the truncated quadratic loss, we have 

^m, < 2{{K + 2C)Xm, + 1) = 0(A„J, 

bm, < ^CXm, + {{K + 2C)\m, + if = 0(A^J. 

Let us define A^i = w-i with < a < 1/4. Then, for any e > 0, ()36p converges to zero when 
mi tends to infinity. In the same way, for the exponential loss we obtain 

Hence, \mx = (log?ni)" with < a < 1 assures the convergence of ([36|) . 



6.3 Convergence to Bayes Risk 

We study the error rate of the estimated classifier. Let us define /, b and p be a minimizer of 
T^Ti,Xmj if ~^b,p). In the proposed learning algorithm in Figure El the estimated bias term b is 
replaced with b which is an optimal solution of mintgR <5t2 (/ + b). We prove that the expected 
0-1 loss <£"(/ + b) converges to the Bayes risk £* , when the sample sizes of Ti and T2 tend to 
infinity. The proof is shown in Appendix [Pl 

Theorem 3. Suppose that TZ{f + b, p) converges to TZ* in probability, when the sample size of 
Ti, i.e., mi, tends to infinity. For the RKHST-L and the loss function I, we assume Assumption 
d and[2 Then, £{f + b) converges to £* in probability, when the sample sizes of Ti and T2 
tend to infinity. 

As a result, we find that the prediction error rate of / + 6 converges to the Bayes risk under 
Assumption m H El and H 

We present some sufficient conditions for existence of the function ijj in Assumption |H The 
proof of the following lemma is shown in Appendix lEl 

Lemma 6. Suppose that the first condition in Assumptions^ and the first condition in Assump- 
tion\^hold. In addition, suppose that i is first-order continuously differentiable on M. Let d be 
d = supjz G M : t'{z) = 0}, where I' is the derivative of I. When i'{z) > holds for all z G M, 
we define d = —00. We assume the following conditions: 

1. d< -£{0)/2. 

2. £{z) is second-order continuously differentiable on the open interval (d, 00). 
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3. e"{z) > holds on {d,oo). 
4- l/i'{z) is convex on {d,oo). 
Then, for any 6 G [0,1], the function ip{9,p) is non- decreasing as the function of p for p > 

-m/2. 

When the condition in Lemma [6] is satisfied, we can choose ip{9, —i{0)/2) as ip{6) for < 
9 < 1, since ip{9, —i{0)/2) is classification-cahbrated under the first condition in Assumption [H 

We give another sufficient condition for existence of the function tjj in Assumption |4l The 
proof of the following lemma is shown in Appendix lEl 

Lemma 7. Suppose that the first condition in Assumptions^ and the first condition in Assump- 
tion^hold. Let d he d = sup{z G M : dl{z) = {0}}. When dl'{z) holds for all 2 G M, we 
define d = —00. Suppose that the inequality — ^(0)/2 > d holds. For p > —i{0)/2 and z > 0, we 
define ^(z, p) by 

. + z)+e{p-z)-2£{p) 



az,p) = { zi'ip) 

0, z = 0. 

Suppose that there exists a function ^{z) for z > such that the following conditions hold: 

1. ^{z) is continuous and strictly increasing on z > 0, and satisfies ^(0) = and 
lim^^ooC(^) > 1- 

2. sup^>„^(o)/2C(^,p) <i{z) holds. 

Then, there exists a function tp defined in the second condition of Assumption^ 

Note that Lemma [7] does not require the second order differentiability of the loss function. 
We show some examples in which the existence of ip is confirmed from the above lemmas. 

Example 8. For the truncated quadratic loss £(z) = (max{z + 1,0})^, the first condition in 
Assumption\^ and the first condition in Assumption^ hold. The inequality — ^(0)/2 = —1/2 > 
sup{z : t'{z) = 0} = —1 in the sufficient condition of Lemma holds. For z > —1, it is 
easy to see that i{z) is second-order differentiable and that i"{z) > holds. In addition, for 
z > —1, l/i'{z) is equal to l/(2z + 2) which is convex on (— l,oo). Therefore, the function 
ip{9) = ip{9,— 1/2) satisfies the second condition in Assumption^ 

Example 9. For the exponential loss i{z) = , we have l/£'{z) = e~^. Hence, due to Lemma 
0, '4}{9,p) is non- decreasing in p. Indeed, we have 11^(9, p) = (1 — \/l — 9'^)eP . 

Example 10. In Example 0, we presented the uncertainty set with estimation errors. The 
uncertainty sets are defined based on the revised function £(z) in ()22p . Here, we use a similar 
function defined by 

({\aw-l\ + hf-{l + hf, a>0, 
« (a) = S ^ (37) 

00, a < U, 
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Figure 6: The derivative of the loss function corresponding to the revised uncertainty set with 
the estimation error. 

for the construction of uncertainty sets. Here, w and h are positive constants, and we suppose 
w > 1/2. The corresponding loss function is given as i{z). Then we have i{z) = u{z/w) defined 
in ()23p . For w > 1/2, we can confirm that sup{z : i'{z) = 0} < — £(0)/2 holds. Since u{z) is 
not strictly convex, Lemma does not work. Hence, we apply Lemma A simple calculation 
yields that £'{-e{0)/2) > {Aw - l)/(4'u;2) > for any h > 0. Note that i{z) is differentiable on 
M. Thus, the monotonicity of i' for the convex function leads to 

p) = J- (Hp+^I^M. _ ^{p)-Hp-^) \ < ^'{p + ^) - ^'{p - ^) 



"(p) V ^ z J - i'ip) 

Figurel^ depicts the derivative of i with h = 1 and w = 1. Since the derivative l'{z) is Lipschitz 
continuous and the Lipschitz constant is equal to l/{2w), we have i'{p + z) — l'{p — z) < z/w. 
Therefore, the inequality 

^, , z/w z/w Aw 

holds. We see that ^{z) = 2z satisfies the sufficient condition of Lemma^ The inequality 

ensures that ^{9) = ^^^^ is a valid choice. Therefore, the loss function corresponding to 
the revised uncertainty set in Example satisfies the sufficient conditions for the Bayes risk 
consistency. 



26 



7 Experiments 



We compare the statistical properties of the proposed learning algorithm to the other learning 
methods. As proved in Section El the kernel-based learning algorithm in Figure [S] has the 
statistical consistency under some assumptions, while MPM and MM-MPM do not have the 
statistical consistency in general. The main purpose of the numerical study is to compare our 
method to MPM and its variants. 

We compare the kernel-based learning algorithms using the Gaussian kernel. So far, many 
works have been devoted to compare the linear models and the kernel-based models. The con- 
clusion is that the linear model outperforms the kernel-based model when the decision boundary 
is well approximated by the linear model. Otherwise, the linear model has the approximation 
bias, and the kernel-based estimators with a nice regularization outperform the linear models 
in general. Hence, we focus on the kernel-based estimators. In our experiments, the following 
methods were examined to the synthetic data and the standard benchmark datasets: C-SVM, 
MPM, unbiased MPM, and the kernel variant of the proposed method presented in Figure El 
For simplicity, the function part f £ Ti and the bias term 6 G M are estimated based on all train- 
ing samples, though in the learning algorithm in Figure [5l the dataset is decomposed into two 
subsets in order to ensure the statistical consistency. In the unbiased MPM, the bias term b in 
the model is estimated by minimizing the training error rate after estimating the function part, 
f £ H. Clearly, the unbiased estimator will outperform the original MPM, when the probability 
of the class label is heavily unbalanced. In the proposed method, we apply the uncertainty set 
defined from the loss function u{z) defined in (j23p . This is the revised uncertainty set of the 
ellipsoidal uncertainty set with the estimation error. The parameter in the function n(z) of (j23p 
is set to /i = or /i = 1. The kernel parameter and the regularization parameter are estimated 
by 5-fold cross validation. We use the test error for the evaluation of the prediction accuracy. 

7.1 Synthetic data 

Suppose that the input points x conditioned on the positive label are generated by the two 
dimensional normal distribution with the mean fip = (0,0)^ and the covariance matrix S.„ = /, 
where / is the identity matrix. In the same way, the conditional distribution of input points with 
the negative label is defined as the normal distribution with ^„ = (1, 1)^ and the covariance 
matrix S„ = i?"'"diag(0.5^, 1.5^)i?, where R is the 7r/3 radian counterclockwise rotation matrix. 
The label probability is defined by P{Y = +1) = 0.2 or 0.5. The size of training samples is 
m = 400. 

Table[3lshows the test error of the estimators: C-SVM, MPM, unbiased MPM, learning with 
the loss function ()23p with h = or h = 1. We notice that, under the unbalanced samples, i.e., 
the case of P(Y = +1) = 0.2, the MPM has the estimation bias. On the setup of the balanced 
data, MPM is slightly better than the other methods. All the learning algorithm except MPM 
are comparable to each other. The difference of the parameter h in the loss function ()23p is not 
significant in this experiment. 

7.2 Benchmark data 

In this section, we use thirteen artificial and real world datasets from the UCI, DELVE, and 
STATLOG benchmark repositories: banana, breast-cancer, diabetes, german, heart, image, 
ringnorm, flare-solar, splice, thyroid, titanic, twonorm, waveform. All datasets are 
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Table 1: Test error (%) of each learning method is presented with the standard deviation. We 
compared C-SVM, MPM, unbiased MPM, learning method with the loss function ()23p with 
h = and h = 1. 



P{Y=+1) 


C-SVM 


MPM 


unbiased MPM 


h = 


h = 1 


0.2 
0.5 


15.8 ± 1.1 
25.2 ± 1.1 


26.0 ±2.2 

25.1 ± 1.0 


16.5 ± 1.2 
25.5 ± 1.3 


15.9 ± 1.1 
25.5 ± 1.4 


16.0 ± 1.2 
25.4 ± 1.1 



provided as IDA benchmark repository. See [20| and [19[ for details of datasets. The properties 
of each dataset are shown in Table [21 where "dim", "P(y = +l)","#train", "#test" and 
"rep." denote the input dimension, the ratio of the positive labels in training samples, the size 
of training set, the size of test set, and the number of replication of learning to evaluate the 
average performance, respectively. 

In the experiment, especially we compare unbiased MPM and our method using the loss 
function ([23]) with h = 0. The uncertainty set of unbiased MPM is ellipsoid defined by the 
estimated covariance matrix. The corresponding loss function of the form of ([8|) does not exist, 
since the convex-hull of the input points is not taken into account. In our method using the 
loss function (j23p with h = 0, the uncertainty set is the intersection of the same ellipsoid as 
unbiased MPM and the convex-hull of the input vectors. That is, the revision of the ellipsoidal 
uncertainty set in unbiased MPM leads to the uncertainty set of our algorithm. We use the 
t-test to detect the difference of test errors of these two learning algorithms. 

Table [3] shows test errors (%) for benchmark datasets with the standard deviation. We show 
the results of C-SVM, MPM, unbiased MPM, learning method with the loss function (j23p with 
h = and h = 1. In the column of the unbiased MPM and our method with h = 0, the bold 
face letters indicates that the test error is smaller compared to the opponent at the significance 
level 1%. Overall, C-SVM performs better than the others, the learning method with the loss 
function (I23p with /i = 1 is comparable to C-SVM except breast-cancer, flare-solar and 
titanic. Note that the loss function (I23p with /i = 1 is similar to the hinge loss around zero. 
Hence, it is clear that the results of our method with /i = 1 is close to the results of C-SVM. The 
results of t-test indicates that, comparing to unbiased MPM, our method using the loss function 
(j23p with h = achieves the smaller test errors. In both algorithms, the same estimator is used 
for the bias term in the decision function. Hence, the result implies that our method is superior 
to unbiased MPM in the estimation of the function part / G "H in the decision function. In 
the dataset flare-solar and titanic, unbiased MPM is superior to our method with h = 0. 
This is because there are many duplications in covariates of these datasets. Indeed, in 666 
training samples of flare-solar, there are only 76 different input points, and titanic has only 
11 different input points out of 150 training samples. In the other datasets, the variety of the 
covariates is almost equal to the size of the training samples. In our method, the uncertainty 
set for such data does not capture the distribution of the input points appropriately. We notice 
that the revision of the uncertainty set will be useful to achieve high prediction accuracy in 
comparison to (unbiased) MPM, as long as the covariate does not have many duplications. 

8 Conclusion 

In this paper, we studied the relation between the loss function approach and the uncertainty 
set approach in binary classification problems. We showed that these two approaches are con- 
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Table 2: The properties of each data sets are shown, where "dim", "P(y = +1)", "Strain", 
"#test" and "rep." denote the input dimension, the ratio of the positive label in training 
samples, the size of training set, the size of test set, and the number of replication of learning, 
respectively. 



dataset 


dim 


I^[Y- + L) 


^train 


jftesi 


rep. 


banana 


o 
z 


U.4b4 


A nn 
4UU 


4yuu 


1 nn 
iUU 


breast-cancer 


9 


0.294 


200 


77 


100 


diabetis 


8 


0.350 


468 


300 


100 


flare-solar 


9 


0.552 


666 


400 


100 


german 


20 


0.301 


700 


300 


100 


heart 


13 


0.445 


170 


100 


100 


image 


18 


0.574 


1300 


1010 


20 


ringnorm 


20 


0.497 


400 


7000 


100 


splice 


60 


0.483 


1000 


2175 


20 


thyroid 


5 


0.305 


140 


75 


85 


titanic 


3 


0.322 


150 


2051 


100 


twonorm 


20 


0.505 


400 


7000 


100 


waveform 


21 


0.331 


400 


4600 


100 



Table 3: Test errors (%) for benchmark datasets are presented with the standard deviation. 
We compared C-SVM, MPM, unbiased MPM, learning method with the loss function (I23p with 
h = and h = 1. We conduct t-test to compare the unbiased MPM and the learning method 
using the loss function (I23p with h = 0. The bold face letters indicates that the test error is 
smaller compared to the opponent at the significance level 1%. 



dataset 


C-SVM 


MPM 


unbiased MPM 


h = 


h = 1 


banana 


10.7 ±0.6 


11.4 ±0.9 


11.4 ±0.9 


11.1 ±0.9 


10.9 ±0.7 


breast-cancer 


26.9 ±4.8 


35.0 ±4.9 


34.0 ±4.8 


28.1 ± 5.0 


28.1 ±4.5 


diabetis 


23.9 ±2.1 


28.8 ±2.4 


28.3 ±2.5 


24.3 ± 1.9 


24.2 ±2.1 


flare-solar 


33.7 ±2.2 


34.9 ± 1.7 


35.7 ± 1.9 


36.8 ±3.1 


36.8 ±2.9 


german 


23.8 ±2.3 


29.2 ±2.4 


28.2 ±2.7 


23.5 ±2.3 


23.6 ±2.4 


heart 


16.7 ±3.5 


25.6 ±4.2 


25.7 ±4.0 


17.3 ±3.7 


17.2 ±3.5 


image 


3.3 ±0.7 


3.2 ±0.7 


3.2 ±0.7 


3.4 ±0.6 


3.3 ±0.5 


ringnorm 


1.7 ±0.3 


3.2 ±0.4 


2.8 ±0.5 


1.7±0.3 


1.6 ±0.2 


sphce 


11.1 ±0.7 


12.3 ± 1.7 


11.7 ±0.8 


11.3 ±0.7 


11.1 ±0.8 


thyroid 


5.3 ±2.1 


6.3 ±3.1 


6.2 ±3.7 


5.6 ±2.4 


5.4 ± 2.2 


titanic 


22.4 ±0.8 


24.1 ±2.2 


22.4 ± 1.2 


23.5 ± 1.6 


23.7 ±3.4 


twonorm 


2.6 ±0.3 


4.5 ±0.7 


4.4 ±0.6 


2.6 ±0.3 


2.6 ±0.4 


waveform 


10.2 ±0.7 


13.0 ±0.9 


12.7 ±0.8 


10.2 ±0.6 


10.1 ±0.7 



nected to each other by the conjugate property based on the Legendre transformation. Given 
a loss function, there exists a corresponding parametrized uncertainty set. In general, however, 
uncertainty set does not correspond to the empirical loss function. We presented a way of re- 
vising the uncertainty set such that there exists an empirical loss function. Then, we proposed 
a modified maximum-margin algorithm based on the parametrized uncertainty set. We proved 
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the statistical consistency of the learning algorithm. Numerical experiments showed that the 
revision of the uncertainty set often improves the prediction accuracy of the classifier. 

In our proof of the statistical consistency, the hinge loss used in I/-SVM is excluded. [2H] 
proved the statistical consistency of ;y-SVM with a nice choice of the regularization parameter. 
We are currently investigating the relaxation of the assumptions of our theoretical result so as 
to include the hinge loss function and other popular loss functions such as the logistic loss. As 
for the statistical modeling, the relation between the loss function approach and the uncertainty 
set approach can be a useful tool. In optimization and control theory, the modeling based 
on the uncertainty set is frequently applied to the real-world data; see the modeling in robust 
optimization and related works We believe that the learning algorithm with the revision of 
the uncertainty set can bridge a gap between statistical modeling based on some intuition and 
nice statistical properties of the estimated classifiers. 
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A Proof of Lemma [T] 

First, we prove the existence of an optimal solution. According to the standard argument on 
the kernel estimator, we can restrict the function part / to be the form of 



4"; 



Then, the problem is reduced to the finite-dimensional problem, 

^ mi mi 

min -2P+—J2 ^iP - ^(E + b)) 

a,b.p nil — — 

1=1 j=l 

mi 

subject to CKjOj 



(38) 



Let Co(q;,6, /o) be the objective function of (p8|) . Let us define S be the linear subspace in 
spanned by the column vectors of the gram matrix {k{x^P ,xf))f^^^^. We can impose the 
constraint a = (ai, . . . ,ami) £ S, since the orthogonal complement of S does not affect the 
objective and the constraint in ([38|) . We see that Assumption [1] and the reproducing property 
yield the inequality \\yf^ Sj^i otj^{'-:^^P)\\oo < K\. Due to this inequality and the assumptions 
on the function the objective function ^0(0;, 6,/?) is bounded below by 

IT) IT) 

Ci{b, p) = -2p + -^l^p -b-KX) + ^l{p + h - KX). (39) 
nil rni 
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Hence, for any real number c, the inclusion relation 



i,j=i ^ 

holds. Note that the vector a satisfying X^^^i aiajk{x^^\x^^^) < and ct G 5 is restricted to 
a compact subset in M™^. We shall prove that the subset (j4U|) is compact, if they are not empty. 
We see that the two sets above are closed subsets, since both (q and Ci are continuous. By 
the variable change from {b,p) to {ui,U2) = {p — b, p + b), Ci{b,p) is transformed to the convex 
function C2iui,U2) defined by 

C2{UI,U2) = -Ui + -^£{ui - KX) -U2 + —£{U2 " K\). 

nil mi 

The subgradient of i{z) diverges to infinity, when z tends to infinity. In addition, i(z) is a 
non-decreasing and non-negative function. Then, we have 

TTl 

lim —ui H -i{ui — KX) = oo. 

|ni|— >-oo nil 

The same limit holds for —U2 + ^i{u2 — KX). Hence, the level set of C2iui,U2) is closed and 
bounded, i.e., compact. As a result, the level set of Ci{b,p) is also compact. Therefore, the 
subset (I40p is also compact in M™!'*'^. This implies that (]38p has an optimal solution. 

Next, we prove the duality between (|29p and (|24p . Since (jSSp has an optimal solution, the 
problem with the slack variables ^j, i = 1, . . . , mi. 



^ mi 

min -2p+ Y^ii^i 



mi 

(1) ^ \2 



subject to aiajk{x- ,Xj ) < X 

mi 

P-yi^\^OiiHx'i^\^f^) + b) <Ci,i = l,---,mi. 
i=i 

also has an optimal solution and the finite optimal value. In addition, the above problem 
clearly satisfies the Slater condition 0, Assumption 6.2.4]. Indeed, at the feasible solution, 
a = 0,b = 0, p = and = l,z = 1,...,?7t,i, the constraint inequalities are all inactive for 
positive A. Hence, Proposition 6.4.3 in [3| ensures that the min-max theorem holds, i.e., there 
is no duality gap. Then, in the same way as Q, we obtain (j24p with the uncertainty set ()27p as 
the dual problem of (f29]l . 



B Proofs of Lemmas in Section 16.21 

We show proofs of lemmas in Section 16. 2i 
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B.l Proof of Lemma [2] 

Let 5 C Af be the subset S = {x £ X : e < P{+l\x) < 1 - e}, then we have P{S) > 0. Due to 
the non-negativity of the loss function we have 

n{f, p) >-2p + j^ |p(+l|x)£(p - f{x)) + P{-l\x)l{p + /(x))|p(dx) 

= ^ { - jj^P + P{+l\x)i{p - fix)) + P{-l\x)l{p + 

For given 7/ satisfying e < r/ < 1 — e, we define the function ^{f, p) by 

P) = - p^P + ^^(z' -/) + (!- ^)^(/' + /), P e M. 

We derive a lower bound inf{^(/, p) : /, p G M}. Since l{z) is a finite- valued convex function on 
M, the subdifferential d^{f,p) C is given as 

diU, p) = |(0, -p^f + ^^(-1' 1)^ + ^(1 - 1)^ : ^ G - /), ^ G + /)| • 



Formulas of the subdifferential are presented in Theorem 23.8 and Theorem 23.9 of [21j]. We 
prove that there exist /* and p* such that (0, 0)^ € d^{f*, p*) holds. Since the second condition 
in Assumption [3] holds for the convex function £, the union Uz&9.dl{z) includes all the positive 
real numbers. Hence, there exist zi and Z2 satisfying ^^^^y G d£{zi) and ^-^_^^p^g^ G di{z2). 
Then, for /* = {z2 — zi)/2, p* = [zi + Z2)/2, the null vector is an element of Since 
(,{f,p) is convex in {f,p), the minimum value of i{f,p) is attained at {f*,p*)- Define Zup as a 
real number satisfying 



1 

eP{sy 



Since e<r/<l — eis assumed, both z\ and Z2 are less than Zup due to the monotonicity of the 
subdifferential. Then, the inequality 

Zl + Z2 , , , ,^ , ^ 2Zup 



/O) > + + (1 - V¥{Z2) > 



P{S) ' ^ - ^ - ^ - - P(S) 

holds for all /, p G M and all r] such that e < rj < 1 — e. Hence, for any measurable function 
f G Lq and p G M, we have 



nf,p)> j^^P{dx)> -2z,p. 



As a result, we have 7?.* > — 2zup > — oo. 



B.2 Proof of Lemma [3] 



Corollary 5.29 of [271] ensures that the equality 



inf{E[£(p - yf{x))] ■.fGn}= inf{E[^(p - : / G Lq} 
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holds for any p G M. Thus, we have mf{7^(/,p) : f £ n} = mf{7^(/,p) : / G Lq} for any p G M. 
Then, the equahty 

M{n{f,p) : f en, peR} = n* 

holds. Under Assumption [2] and Assumption [3l we have TZ* > — oo due to Lemma [2j Then, for 
any e > 0, there exist Xe > 0, fe £ 7i and pe G M such that H/eHk ^ and TZ{fe, Pe) ^ T^* + ^ 
hold. For all A > we have 

infilZxif, p): f €n,p€R}< UxUe^Pe) = n{fe,Pe) < 7^* + £. 

On the other hand, it is clear that the inequality TZ* < mi{TZx{f, p) : f £ Ti, p M} holds. 
Hence, Eq.l^ holds. 

B.3 Proof of Lemma [4] 

Under Assumption [21 the label probabilities, P{y = +1) and P{y = —1), are positive. We 
assume that the inequalities 

W = +l)<^, W = -l)<^ (41) 
2 mi 2 mi 

hold. Applying Chernoff bound, we see that there exists a positive constant c > depending 
only on the marginal probability of the label such that (|4ip holds with the probability higher 
than 1 - e-"""^ . 

Lemma [1] ensures that the problem ()29p has optimal solutions f,b,p. The first inequality in 
([3T|) . i.e., WfWn < Ami, is clearly satisfied. Then, we have ||/||oo < KX^-^ from the reproducing 
property of the RKHSs. The definition of the estimator and the non-negativity of i yield that 

-, mi 

_2p < -2p+—^i{p- + b)) < ^Ti,A„,(0,0) = £(0). 



mi 



Then, we have 



P>-^. (42) 



Next, we consider the optimality condition of T^TlA™ • According to the calculus of subdiffer- 
ential introduced in Section 23 of 2l|], the derivative of the objective function with respect to p 
leads to an optimality condition, 

-, mi 

G _2+_^5^(p-yf)(/(xp)) + &))- 
mi ^-^ 



The monotonicity and non-negativity of the subdifferential and the bound of ||/||oo lead to 

^ mi 

2 > — V di{p - yf^b - KXm, ) 



m„ 



-, '"p 1 m„ 

= —y^dl{p-b-KX^,) + —y^dl{p + b-KX^,) 
mi ^-^ mi ^-^ 

1=1 j=i 

nip 

> _Y,di{p-b-KX„^,). 



mi . 
1=1 
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The above expression means that there exist numbers in the subdifferential such that the inequal- 
ity holds, where denotes the mp-fold sum of the set dl. Let Zp be a real number satisfying 
< d£{zp), i.e., all elements in d£{zp) are greater than Then, p — b — KXmi should be 

less than Zp. In the same way, for z„ satisfying < di{zn), we have p + b — KXmi < ^n- The 
existence of Zp and Zn is guaranteed by Assumption [3l Hence, the inequalities 

p < KXrrii + max{zp, z„}, 

H < + KXmi + max{zp, Zn} 

hold, in which p > — £(0)/2 is used in the second inequality. Define z as a real number such that 



Inequalities in (pT|) lead to 



max 



J 2mi 


2mi \ r 




> < max < 




rrin J t 



p{Y = +1)' p(y = -1) 



Hence, we Ccin choose z ss-tisfying iwsyi.^^Zp^ Zn } < z. Suppose that ^(0)/2 < KXmi + ^ holds for 
mi > M. Then, the inequalities 

\p\ < 2KXmi + 2^, H < 2KXmi + 2^, 

hold with the probability higher than 1 — e"'^™'^ for mi > M. By choosing an appropriate 
positive constant C > 0, we obtain ([3T]) . 



B.4 Proof of Lemma [5] 

Since ||/||oo < KXmi holds for / G ^ such that < A.^^, we have the following inequality 

sup L{x,yJ,b,p) - inf L{x,y;f,b,p) 

{x,y)£Xx{+l~l} (x,y)£Xx{+l-l} 

UM^G^i UM<^gmi 

< 2CXmi + sup £{p - y{f{x) + b))- {-2CXmi ) 

if,b,p)&gmi 

< 4CXmi + iiCXmi + KXmi + C Xmi) 
— bmi ■ 



In the same way as the proof of Lemma 3.4 in j26l ]. Hoeffding's inequality leads to the upper 



bound (j35|) . Eq. ([36|) is the direct conclusion of ([33|) and 

C Proof of Theorem [2] 

Lemma [3] assures that, for any 7 > 0, there exists sufficiently large Mi such that 

\mf{nx^^{f + b,p):fen,b,peR}-n*\<j 
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holds for all mi > Mi. Thus, there exist and such that 

\nx^^if^ + b^,P^)-n*\<2-f 

and II/7II-H < Ami hold for mi > Mi. Due to the law of large numbers, the inequality 

I^Ti(/7 + ^7,^7) -n{f-y + b^,p^)\ < 7 

holds with high probability, say 1 — 6mi, for mi > M2. The boundedness property in Lemma H] 
leads to 

for mi > M'i. In addition, by the uniform bound shown in Lemma [5l the inequality 

sup |^Tl(/ + 6,/5)-7^(/ + fe,p)| <7 

holds with probability 1 — 5'^^ . Hence, the probability such that the inequality 

\nT,{f+\p)-n(f+\p)\<i 

holds is higher than 1 - e'""^^ - 5',,^ for mi > M3. Let Mq be Mq = max{Mi, M2, M3}. Then, 
for any 7 > 0, the following inequalities hold with probability higher than 1 — e"'^™^ — 5'^_^ — 
for mi > Mq, 

n{f+\p)<nTAJ+b,p) + i 

<nTl{fy + by,Py)+-f (43) 

<7^(/7 + 67,^7) + 27 
< 7^* + 47. 

The second inequality (1431) above is given as 

■^Ti {f + b, p) = TlT^Mn (/ P) ^ T^T^,\^^ [f-i + b-y,p-y) = Uti (A + ^7,^7)- 

D Proof of Theorem [3] 

For a fixed p such that p > —£{0)/2, the loss function £{p—z) is classification-calibrated since 
£'{p) > holds. Hence ip{6,p) in Assumption H satisfies -0(0, p) = 0, ip{9,p) > for < 6^ < 1, 
and ip{9,p) is continuous and strictly increasing in G [0,1]- In addition, for all / G ^ and 
6 E M, the inequality 

^{£{f + b)-r,p)< E[e{p - y{f{x) + b))] - inf W{p - y{f{x) + b))] 

holds. Details are presented in Theorem 1 and Theorem 2 of Q]. Here we used the equality 
inf{E[f - y(/(x) + 6))] : / G 6 G M} = inf{E[£(p - y{f{x) + b))] : f € Lo,b e M}, 
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which is shown in Corohary 5.29 of [27l |. Hence, we have 

^{£{f + b) - £*,p) < E[i{p- y{f{x) + b))] - inf E[£{p-y{f{x) + b))] 

= 7j(/+j,a-^jj__7j(/ + i,,?), 

since p> —£{0)/2 holds due to ([42]) . We assumed that TZ{f + b, p) converges to TZ* in probabihty. 
Then, for any e > 0, the inequahty 

n* < inf nif + b,p) <n{f + b,p) <n* + e 

holds with high probability for sufficiently large mi. Thus, ip{£{f + b) — £*,p) converges to zero 
in probability. The inequality 

o<^i£if+b)-n<Hsif+b)-£*,p) 

and the assumption on the function ip ensure that £{f + b) converges to £* in probability, when 
mi tends to infinity. As a result, for any 7 > 0, 

\£{f + h)-£*\<^ (44) 

holds with probability higher than 1 — 5mi,7 ^i^^ respect to the probability distribution of Ti, 
where 5mi,7 satisfies Wmm^^oo Smi,'y = for any 7 > 0. 

Next, we study the relation between f + b and f + b. The sample size of T2 is m2. For 
any fixed f £ H, we define the set of 0-1 valued functions, Sf = {{/(x) + b > Oj : b G M}. 
The VC-dimension of Sf equals to on^. Indeed, for two distinct points x,x' X such that 
f{x) > f{x'), the event such that |/(x) + 6 > 0] = and lf{x') + 6 > 0] = 1 is impossible. 
Hence, for any e > and any f £ T-L, the inequality 

snp\£T,{f + b)-£{f + b)\<j (45) 

holds with probability higher than 1 — 6'^^ ^ with respect to the joint probability of training 
sample T2. Note that S'^^ .^ depends only on m2, 7 and the VC-dimension of Sf. Thus, 5'^^ is 

independent of the choice of f £H. Remember that f + b depends only on the data set Ti . Due 
to the law of large numbers, the inequality 

\£T,{f + b)-£if + b)\<^ 

holds with probability higher than 1 — 6'^^ ^ with respect to the probability distribution of T2 
conditioned on Ti. Since the 0-1 loss is bounded, it is possible to choose S!^^ ,^ independent of 
/. From the uniform convergence property ()45p . the following inequality also holds 



|<?T.(/ + 6)-f(/ + fo)| <7 



with probability higher than 1 — (5^2,7 ^ith respect to the probability distribution of T2 condi- 
tioned on the observation of Ti. In addition, we have 



%(/ + 6) <%(/ + &). 



^See [29|l for the definition of the VC dimension. 
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Given the training samples Ti satisfying (j44|) . the inequahties 

£{f + b) < £T,{f + b) + 7 < £T,{f + b) + 7 < £{f + b) + 27 < ^ + 87 

hold with probability higher than 1 — 5^2.7 ~ '^m2-7 '^i^h respect to the probability distribution 
of T2 conditioned on the observation of Ti. Hence, as for the conditional probability, we have 

P({T2 : £if + b) <£* + 37} I Ti) > 1 - 4^,^ - C,7- 

Remember that S'^^n ^^'^ ^m2a "^^^ depend on Ti. Hence, as for the joint probability of Ti 
and we have 

P({Ti, r2 : £{J+b) <£*+ 37}) > (1 - C,7 - C7)(l - ^rn,n)- 

The above inequality implies that £{f + b) converges to £* in probability, when mi and m2 tend 
to infinity. 

E Proofs of Lemma [6] and Lemma [7] 
E.l Proof of Lemma [6] 

For = and = 1, we can directly confirm that the lemma holds. In the following, we assume 
< < 1 and p > — -^(0)/2. We consider the following optimization problem involved in ip{9, p), 

infj-±^i{p-z) + ^£{p + z). (46) 

The objective function is a finite-valued convex function on M, and diverges to infinity when z 
tends to ±00. Hence, there exists an optimal solution. Let z* G M be an optimal solution of 
(j46p . The optimality condition is given as 

(1 + 0)i\p - z*) - (1 - ey'ip + z*) = 0. 

We assumed that both 1 + 6' and 1 — 6 are positive and that p > —£{0)/2 > d holds. Hence, both 
l'{p — z*) and (.'{p + z*) should not be zero. Indeed, if one of them is equal to zero, the other 
is also zero. Hence, we have p — z* < d and p + z* < d. These inequalities contradict p > d. 
Then, we have p — z* > d and p + z* > d, i.e., \z*\ < p — d. In addition, we have 

i + e ^ £'{p + z*) 
2 ~ e'ip + z*) + i'{p- z*)' 

Since £"{z) > holds on (d, 00), the second derivative of the objective in (|16]) satisfies the 
positivity condition, 

(1 + 9)i"{p -z) + {l- 0)i"{p + z) > 

for all z such that p — z > d and p + z > d. Therefore, z* is uniquely determined. For a 
fixed 9 G (0, 1), the optimal solution can be described as the function of p, i.e., z* = z{p). By 
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the implicit function theorem, z{p) is continuously differentiable with respect to p. Then, the 
derivative of 'ip{6,p) is given as 

.n.)-l±^r„-.,„,(i-|)-i^r„..(„,(i.| 



e'f ^ f (p + zip)) ( dz 

' - np+<P))^np-<P))' - ^^^^^ V - d-p 



i'{p + zip))+i'ip-z{p)) 

2i'ip-z{p))£'ip + z{p)) 



^P) 



e'{p + z{p))+£'{p-z{p))- 
The convexity of l/i'{z) for z > d leads to 

„^ 1 ^ 1 , 1 _ e{p + z{p)) + e{p-z{p)) 

e'ip) - 2£'{p + z{p)) ^ 2f (p - z{p)) 2£'ip - z{p))£>{p + z{p)) ' 

Hence, we have 

§-p^iG,P)>^ 

for p > —£[Q)/2 > d and < < 1. As a result, we see that ip{0,p) is non-decreasing as the 
function of p. 

E.2 Proof of Lemma [7] 

We use the result of For a fixed p, the function i{z,p) is continuous for z > 0, and the 
convexity of £ leads to the non-negativity of ^{z,p). Moreover, the convexity and the non- 
negativity of £{z) lead to 



z£'{p) z£'{p) - z£'{p) 

for z > and p > —£{0)/2, where £{p) and £'{p) are positive for p > —£{0)/2. The above 
inequality and the continuity of £,{-,p) ensure that there exists z satisfying S,{z,p) = 9 for all 9 
such that < ^ < 1. We define the inverse function by 

^~H0) = mi{z>0:az,p)=9} 

for < 9 < 1. For a fixed p > — £(0)/2, the loss function £{p — z) is classification-calibrated 
Hence, Lemma 3 in 0] leads to the inequality 

for < 61 < 1. Define ^"^ by 

CHO) = mf{z > : ^(z) = 9}. 
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From the definition of S,{z), (,~^{6) is well-defined for all 6 G [0, 1). Since ^{z,p) < ^{z) holds, 
we have ^~^{9/2) > ^~^{9/2). In addition, (.'{p) is non-decreasing as the function of p. Thus, 
we have 

i^{e,p)>e{-mm\t\^ 

for all p > —£{0)/2 and < 6* < 1. Then, we can choose 

It is straightforward to confirm that the conditions of Assumption [5] are satisfied. 
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